Computer-Implemented System And Method For Providing Classification Suggestions

ABSTRACT

A computer-implemented system and method for providing classification suggestions is provided. A set of uncoded documents is maintained. One of the uncoded documents is selected and compared with a set of reference documents, each associated with a classification. Those reference documents that are similar to the uncoded document are identified. Relationships between the uncoded document and each reference document are identified by counting a number of similar reference documents associated with each different classification. The classification having a highest count of similar reference documents is selected for the selected uncoded document as a suggestion.

CROSS-REFERENCE TO RELATED APPLICATION

This patent application is a continuation of commonly-assigned U.S.patent application Ser. No. 14/065,364, filed on Oct. 28, 2013, pending;which is a continuation of U.S. Pat. No. 8,572,084, issued Oct. 29,2013; which claims priority under 35 U.S.C. §119(e) to U.S. ProvisionalPatent Application, Serial No. 61/229,216, filed July 28, 2009, and U.S.Provisional Patent Application, Ser. No. 61/236,490, filed Aug. 24,2009, the priority dates of which are claimed and the disclosures ofwhich are incorporated by reference.

FIELD

This application relates in general to using documents as a referencepoint and, in particular, to a system and method for providingclassification suggestions.

BACKGROUND

Historically, document review during the discovery phase of litigationand for other types of legal matters, such as due diligence andregulatory compliance, have been conducted manually. During documentreview, individual reviewers, generally licensed attorneys, are assignedsets of documents for coding. A reviewer must carefully study eachdocument and categorize the document by assigning a code or other markerfrom a set of descriptive classifications, such as “privileged,”“responsive,” and “non-responsive.” The classifications can affect thedisposition of each document, including admissibility into evidence.

During discovery, document review can potentially affect the outcome ofthe underlying legal matter, so consistent and accurate results arecrucial. Manual document review is tedious and time-consuming. Markingdocuments is solely at the discretion of each reviewer and inconsistentresults may occur due to misunderstanding, time pressures, fatigue, orother factors. A large volume of documents reviewed, often with onlylimited time, can create a loss of mental focus and a loss of purposefor the resultant classification. Each new reviewer also faces a steeplearning curve to become familiar with the legal matter, classificationcategories, and review techniques.

Currently, with the increasingly widespread movement to electronicallystored information (ESI), manual document review is no longerpracticable. The often exponential growth of ESI exceeds the boundsreasonable for conventional manual human document review and underscoresthe need for computer-assisted ESI review tools.

Conventional ESI review tools have proven inadequate to providingefficient, accurate, and consistent results. For example, DiscoverReadyLLC, a Delaware limited liability company, custom programs ESI reviewtools, which conduct semi-automated document review through multiplepasses over a document set in ESI form. During the first pass, documentsare grouped by category and basic codes are assigned. Subsequent passesrefine and further assign codings. Multiple pass review requires apriori project-specific knowledge engineering, which is only useful forthe single project, thereby losing the benefit of any inferred knowledgeor know-how for use in other review projects.

Thus, there remains a need for a system and method for increasing theefficiency of document review that bootstraps knowledge gained fromother reviews while ultimately ensuring independent reviewer discretion.

SUMMARY

Document review efficiency can be increased by identifying relationshipsbetween reference ESI and uncoded ESI, and providing a suggestion forclassification based on the relationships. The uncoded ESI for adocument review project are identified and clustered. At least one ofthe uncoded ESI is selected from the clusters and compared with thereference ESI based on a similarity metric. The reference ESI mostsimilar to the selected uncoded ESI are identified. Classification codesassigned to the similar reference ESI can be used to provide suggestionsfor classification of the selected uncoded ESI. Further, amachine-generated suggestion for classification code can be providedwith a confidence level.

An embodiment provides a computer-implemented system and method forproviding classification suggestions. A set of uncoded documents ismaintained.

One of the uncoded documents is selected and compared with a set ofreference documents, each associated with a classification. Thosereference documents that are similar to the uncoded document areidentified. Relationships between the uncoded document and eachreference document are identified by counting a number of similarreference documents associated with each different classification. Theclassification having a highest count of similar reference documents isselected for the selected uncoded document as a suggestion.

Still other embodiments of the present invention will become readilyapparent to those skilled in the art from the following detaileddescription, wherein are described embodiments by way of illustratingthe best mode contemplated for carrying out the invention. As will berealized, the invention is capable of other and different embodimentsand its several details are capable of modifications in various obviousrespects, all without departing from the spirit and the scope of thepresent invention. Accordingly, the drawings and detailed descriptionare to be regarded as illustrative in nature and not as restrictive.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram showing a system for displaying relationshipsbetween electronically stored information to provide classificationsuggestions via nearest neighbor, in accordance with one embodiment.

FIG. 2 is a process flow diagram showing a method for displayingrelationships between electronically stored information to provideclassification suggestions via nearest neighbor, in accordance with oneembodiment.

FIG. 3 is a block diagram showing, by way of example, measures forselecting a document reference subset.

FIG. 4 is a process flow diagram showing, by way of example, a methodfor comparing an uncoded document to reference documents for use in themethod of FIG. 2.

FIG. 5 is a screenshot showing, by way of example, a visual display ofreference documents in relation to uncoded documents.

FIG. 6 is an alternative visual display of the similar referencedocuments and uncoded documents.

FIG. 7 is a process flow diagram showing, by way of example, a methodfor classifying uncoded documents for use in the method of FIG. 2.

DETAILED DESCRIPTION

The ever-increasing volume of ESI underlies the need for automatingdocument review for improved consistency and throughput. Previouslycoded documents offer knowledge gleaned from earlier work in similarlegal projects, as well as a reference point for classifying uncodedESI.

Providing Suggestions Using Reference Documents

Reference documents are documents that have been previously classifiedby content and can be used to influence classification of uncoded, thatis unclassified, ESI. Specifically, relationships between the uncodedESI and the reference ESI can be visually depicted to providesuggestions, for instance to a human reviewer, for classifying thevisually-proximal uncoded ESI.

Complete ESI review requires a support environment within whichclassification can be performed. FIG. 1 is a block diagram showing asystem 10 for displaying relationships between electronically storedinformation to provide classification suggestions via nearest neighbor,in accordance with one embodiment. By way of illustration, the system 10operates in a distributed computing environment, which includes aplurality of heterogeneous systems and ESI sources. Henceforth, a singleitem of ESI will be referenced as a “document,” although ESI can includeother forms of non-document data, as described infra. A backend server11 is coupled to a storage device 13, which stores documents 14 a, suchas uncoded documents, in the form of structured or unstructured data, adatabase 30 for maintaining information about the documents, and alookup database 38 for storing many-to-many mappings 39 betweendocuments and document features, such as concepts. The storage device 13also stores reference documents 14 b, which can provide a training setof trusted and known results for use in guiding ESI classification. Thereference documents 14 b are each associated with an assignedclassification code and considered as classified or coded. Hereinafter,the terms “classified” and “coded” are used interchangeably with thesame intended meaning, unless otherwise indicated. A set of referencedocuments can be hand-selected or automatically selected through guidedreview, which is further discussed below. Additionally, the set ofreference documents can be predetermined or can be generateddynamically, as the selected uncoded documents are classified andsubsequently added to the set of reference documents.

The backend server 11 is coupled to an intranetwork 21 and executes aworkbench suite 31 for providing a user interface framework forautomated document management, processing, analysis, and classification.In a further embodiment, the backend server 11 can be accessed via aninternetwork 22. The workbench software suite 31 includes a documentmapper 32 that includes a clustering engine 33, similarity searcher 34,classifier 35, and display generator 36. Other workbench suite modulesare possible.

The clustering engine 33 performs efficient document scoring andclustering of documents, including uncoded and coded documents, such asdescribed in commonly-assigned U.S. Pat. No. 7,610,313, the disclosureof which is incorporated by reference. Clusters of uncoded documents 14a can be formed and organized along vectors, known as spines, based on asimilarity of the clusters, which can be expressed in terms of distance.During clustering, groupings of related documents are provided. Thecontent of each document can be converted into a set of tokens, whichare word-level or character-level n-grams, raw terms, concepts, orentities. Other tokens are possible. An n-gram is a predetermined numberof items selected from a source. The items can include syllables,letters, or words, as well as other items. A raw term is a term that hasnot been processed or manipulated. Concepts typically include nouns andnoun phrases obtained through part-of-speech tagging that have a commonsemantic meaning. Entities further refine nouns and noun phrases intopeople, places, and things, such as meetings, animals, relationships,and various other objects. Entities can be extracted using entityextraction techniques known in the field. Clustering of the documentscan be based on cluster criteria, such as the similarity of tokens,including n-grams, raw terms, concepts, entities, email addresses, orother metadata.

In a further embodiment, the clusters can include uncoded and codeddocuments, which are generated based on a similarity with the uncodeddocuments, as discussed in commonly-owned U.S. Pat. No. 8,713,018,issued on Apr. 29, 2014, and U.S. Pat. No. 8,515,957, issued Aug. 20,2013, the disclosures of which are incorporated by reference.

The similarity searcher 34 identifies the reference documents 14 b thatare most similar to selected uncoded documents 14 a, clusters, orspines, as further described below with reference to FIG. 4. Forexample, the uncoded documents, reference documents, clusters, andspines can each be represented by a score vector, which includes pairedvalues consisting of a token, such as a term occurring in that document,cluster or spine, and the associated score for that token. Subsequently,the score vector of the uncoded document, cluster, or spine is thencompared with the score vectors of the reference documents to identifysimilar reference documents.

The classifier 35 provides a machine-generated suggestion and confidencelevel for classification of selected uncoded documents 14 a, clusters,or spines, as further described below with reference to FIG. 7. Thedisplay generator 36 arranges the clusters and spines in thematicrelationships in a two-dimensional visual display space, as furtherdescribed below beginning with reference to FIG. 5. Once generated, thevisual display space is transmitted to a work client 12 by the backendserver 11 via the document mapper 32 for presenting to a reviewer on adisplay 37. The reviewer can include an individual person who isassigned to review and classify one or more uncoded documents bydesignating a code. Hereinafter, the terms “reviewer” and “custodian”are used interchangeably with the same intended meaning, unlessotherwise indicated. Other types of reviewers are possible, includingmachine-implemented reviewers.

The document mapper 32 operates on uncoded 14 a and coded documents 14b, which can be retrieved from the storage 13, as well as from aplurality of local and remote sources. The local sources include a localserver 15, which is coupled to a storage device 16 with documents 17 anda local client 18, which is coupled to a storage device 19 withdocuments 20. The local server 15 and local client 18 are interconnectedto the backend server 11 and the work client 12 over an intranetwork 21.In addition, the document mapper 32 can identify and retrieve documentsfrom remote sources over an internetwork 22, including the Internet,through a gateway 23 interfaced to the intranetwork 21. The remotesources include a remote server 24, which is coupled to a storage device25 with documents 26 and a remote client 27, which is coupled to astorage device 28 with documents 29. Other document sources, eitherlocal or remote, are possible.

The individual documents 17, 20, 26, 29 include all forms and types ofstructured and unstructured ESI, including electronic message stores,word processing documents, electronic mail (email) folders, Web pages,and graphical or multimedia data. Notwithstanding, the documents couldbe in the form of structurally organized data, such as stored in aspreadsheet or database.

In one embodiment, the individual documents 14 a, 14 b, 17, 20, 26, 29include electronic message folders storing email and attachments, suchas maintained by the Outlook and Outlook Express products, licensed byMicrosoft Corporation, Redmond, Wash. The database can be an SQL-basedrelational database, such as the Oracle database management system,Release 8, licensed by Oracle Corporation, Redwood Shores, Calif.

The individual documents 17, 20, 26, 29 can be designated and stored asuncoded documents or reference documents. The uncoded documents, whichare unclassified, are selected for a document review project and storedas a document corpus for classification. The reference documents areinitially uncoded documents that can be selected from the corpus orother source of uncoded documents, and subsequently classified. Thereference documents can assist in providing suggestions forclassification of the remaining uncoded documents based on visualrelationships between the uncoded documents and reference documents. Ina further embodiment, the reference documents can provide classificationsuggestions for a document corpus associated with a related documentreview project. In yet a further embodiment, the reference documents canbe used as a training set to form machine-generated suggestions forclassifying uncoded documents, as further described below with referenceto FIG. 7.

The document corpus for a document review project can be divided intosubsets of uncoded documents, which are each provided to a particularreviewer as an assignment. To maintain consistency, the sameclassification codes can be used across all assignments in the documentreview project. Alternatively, the classification codes can be differentfor each assignment. The classification codes can be determined usingtaxonomy generation, during which a list of classification codes can beprovided by a reviewer or determined automatically. For purposes oflegal discovery, the list of classification codes can include“privileged,” “responsive,” or “non-responsive;” however, otherclassification codes are possible. A “privileged” document containsinformation that is protected by a privilege, meaning that the documentshould not be disclosed or “produced” to an opposing party. Disclosing a“privileged” document can result in an unintentional waiver of thesubject matter disclosed. A “responsive” document contains informationthat is related to a legal matter on which the document review projectis based and a “non-responsive” document includes information that isnot related to the legal matter.

The system 10 includes individual computer systems, such as the backendserver 11, work server 12, server 15, client 18, remote server 24 andremote client 27. The individual computer systems are general purpose,programmed digital computing devices consisting of a central processingunit (CPU), random access memory (RAM), non-volatile secondary storage,such as a hard drive or CD ROM drive, network interfaces, and peripheraldevices, including user interfacing means, such as a keyboard anddisplay. The various implementations of the source code and object andbyte codes can be held on a computer-readable storage medium, such as afloppy disk, hard drive, digital video disk (DVD), random access memory(RAM), read-only memory (ROM) and similar storage mediums. For example,program code, including software programs, and data are loaded into theRAM for execution and processing by the CPU and results are generatedfor display, output, transmittal, or storage.

Identifying relationships between the reference documents and uncodeddocuments includes clustering and similarity measures. FIG. 2 is aprocess flow diagram showing a method 40 for displaying relationshipsbetween electronically stored information to provide classificationsuggestions via nearest neighbor, in accordance with one embodiment. Aset of document clusters is obtained (block 41). In one embodiment, theclusters can include uncoded documents, and in a further embodiment, theclusters can include uncoded and coded documents. The clustered uncodeddocuments can represent a corpus of uncoded documents for a documentreview project, or one or more assignments of uncoded documents. Thedocument corpus can include all uncoded documents for a document reviewproject, while, each assignment can include a subset of uncodeddocuments selected from the corpus and assigned to a reviewer. Thecorpus can be divided into assignments using assignment criteria, suchas custodian or source of the uncoded document, content, document type,and date. Other criteria are possible. Prior to, concurrent with, orsubsequent to obtaining the cluster set, reference documents areidentified (block 42). The reference documents can include all referencedocuments generated for a document review project, or alternatively, asubset of the reference documents. Obtaining reference documents isfurther discussed below with reference to FIG. 3.

An uncoded document is selected from one of the clusters in the set andcompared against the reference documents (block 43) to identify one ormore reference documents that are similar to the selected uncodeddocument (block 44). The similar reference documents are identifiedbased on a similarity measure calculated between the selected uncodeddocument and each reference document. Comparing the selected uncodeddocument with the reference documents is further discussed below withreference to FIG. 4. Once identified, relationships between the selecteduncoded document and the similar reference documents can be identified(block 45) to provide classification hints, including a suggestion forthe selected uncoded document, as further discussed below with referenceto FIG. 5. Additionally, machine-generated suggestions forclassification can be provided (block 46) with an associated confidencelevel for use in classifying the selected uncoded document.Machine-generated suggestions are further discussed below with referenceto FIG. 7. Once the selected uncoded document is assigned aclassification code, either by the reviewer or automatically, the newlyclassified document can be added to the set of reference documents foruse in classifying further uncoded documents. Subsequently, a furtheruncoded document can be selected for classification using similarreference documents.

In a further embodiment, similar reference documents can also beidentified for a selected cluster or a selected spine along which theclusters are placed.

Selecting a Document Reference Subset

After the clusters have been generated, one or more uncoded documentscan be selected from at least one of the clusters for comparing with areference document set or subset. FIG. 3 is a block diagram showing, byway of example, measures 50 for selecting a document reference subset51. The subset of reference documents 51 can be previously defined 54and maintained for related document review projects or can bespecifically generated for each review project. A predefined referencesubset 54 provides knowledge previously obtained during the relateddocument review project to increase efficiency, accuracy, andconsistency. Reference subsets newly generated for each review projectcan include arbitrary 52 or customized 53 reference subsets that aredetermined automatically or by a human reviewer. An arbitrary referencesubset 52 includes reference documents randomly selected for inclusionin the reference subset. A customized reference subset 53 includesreference documents specifically selected for inclusion in the referencesubset based on criteria, such as reviewer preference, classificationcategory, document source, content, and review project. Other criteriaare possible.

The subset of reference documents, whether predetermined or newlygenerated, should be selected from a set of reference documents that arerepresentative of the document corpus for a review project in which dataorganization or classification is desired. Guided review assists areviewer or other user in identifying reference documents that arerepresentative of the corpus for use in classifying uncoded documents.During guided review, the uncoded documents that are dissimilar to allother uncoded documents are identified based on a similarity threshold.In one embodiment, the dissimilarity can be determined as the cos σ ofthe score vectors for the uncoded documents. Other methods fordetermining dissimilarity are possible. Identifying the dissimilardocuments provides a group of documents that are representative of thecorpus for a document review project. Each identified dissimilardocument is then classified by assigning a particular classificationcode based on the content of the document to collectively generate thereference documents. Guided review can be performed by a reviewer, amachine, or a combination of the reviewer and machine.

Other methods for generating reference documents for a document reviewproject using guided review are possible, including clustering. A set ofuncoded documents to be classified is clustered, as described incommonly-assigned U.S. Pat. No. 7,610,313, the disclosure of which isincorporated by reference. A plurality of the clustered uncodeddocuments are selected based on selection criteria, such as clustercenters or sample clusters. The cluster centers can be used to identifyuncoded documents in a cluster that are most similar or dissimilar tothe cluster center. The selected uncoded documents are then assignedclassification codes. In a further embodiment, sample clusters can beused to generate reference documents by selecting one or more sampleclusters based on cluster relation criteria, such as size, content,similarity, or dissimilarity. The uncoded documents in the selectedsample clusters are then selected for classification by assigningclassification codes. The classified documents represent referencedocuments for the document review project. The number of referencedocuments can be determined automatically or by a reviewer. Othermethods for selecting documents for use as reference documents arepossible.

Comparing a Selected Uncoded Document to Reference Documents

An uncoded document selected from one of the clusters can be compared tothe reference documents to identify similar reference documents for usein providing suggestions regarding classification of the selecteduncoded document. FIG. 4 is a process flow diagram showing, by way ofexample, a method 60 for comparing an uncoded document to referencedocuments for use in the method of FIG. 2. The uncoded document isselected from a cluster (block 61) and applied to the referencedocuments (block 62). The reference documents can include all referencedocuments for a document review project or a subset of the referencedocuments. Each of the reference documents and the selected uncodeddocument can be represented by a score vector having paired values oftokens occurring within that document and associated token scores. Asimilarity between the uncoded document and each reference document isdetermined (block 63) as the cos σ of the score vectors for the uncodeddocument and reference document being compared and is equivalent to theinner product between the score vectors. In the described embodiment,the cos σ is calculated in accordance with the equation:

${\cos \; \sigma_{AB}} = \frac{\langle{{\overset{\rightarrow}{S}}_{A} \cdot {\overset{\rightarrow}{S}}_{B}}\rangle}{{{\overset{\rightarrow}{S}}_{A}}{{\overset{\rightarrow}{S}}_{B}}}$

where cos σ_(AB) comprises a similarity between uncoded document A andreference document B, {right arrow over (S)}_(A) comprises a scorevector for uncoded document A, and {right arrow over (S)}_(B) comprisesa score vector for reference document B. Other forms of determiningsimilarity using a distance metric are possible, as would be recognizedby one skilled in the art, including using Euclidean distance.

One or more of the reference documents that are most similar to theselected uncoded document, based on the similarity metric, areidentified. The most similar reference documents can be identified bysatisfying a predetermined threshold of similarity. Other methods fordetermining the similar reference documents are possible, such assetting a predetermined absolute number of the most similar referencedocuments. The classification codes of the identified similar referencedocuments can be used as suggestions for classifying the selecteduncoded document, as further described below with reference to

FIG. 5. Once identified, the similar reference documents can be used toprovide suggestions regarding classification of the selected uncodeddocument, as further described below with reference to FIGS. 5 and 7.

Displaying the Reference Documents

The similar reference documents can be displayed with the clusters ofuncoded documents. In the display, the similar reference documents canbe provided as a list, while the clusters can be can be organized alongspines of thematically related clusters, as described incommonly-assigned U.S. Pat. No. 7,271,804, the disclosure of which isincorporated by reference. The spines can be positioned in relation toother cluster spines based on a theme shared by those cluster spines, asdescribed in commonly-assigned U.S. Pat. No. 7,610,313, the disclosureof which is incorporated by reference. Other displays of the clustersand similar reference documents are possible.

Organizing the clusters into spines and groups of cluster spinesprovides an individual reviewer with a display that presents thedocuments according to a theme while maximizing the number ofrelationships depicted between the documents. FIG. 5 is a screenshot 70showing, by way of example, a visual display 71 of similar referencedocuments 74 and uncoded documents 74. Clusters 72 of the uncodeddocuments 73 can be located along a spine, which is a vector, based on asimilarity of the uncoded documents 73 in the clusters 72. The uncodeddocuments 73 are each represented by a smaller circle within theclusters 72.

Similar reference documents 74 identified for a selected uncodeddocument 73 can be displayed in a list 75 by document title or otheridentifier. Also, classification codes 76 associated with the similarreference documents 74 can be displayed as circles having a diamondshape within the boundary of the circle. The classification codes 76 caninclude “privileged,” “responsive,” and “non-responsive” codes, as wellas other codes. The different classification codes 76 can each berepresented by a color, such as blue for “privileged” referencedocuments and yellow for “non-responsive” reference documents. Otherdisplay representations of the uncoded documents, similar referencedocuments, and classification codes are possible, including by symbolsand shapes.

The classification codes 76 of the similar reference documents 74 canprovide suggestions for classifying the selected uncoded document basedon factors, such as a number of different classification codes for thesimilar reference documents and a number of similar reference documentsassociated with each classification code. For example, the list ofreference documents includes four similar reference documents identifiedfor a particular uncoded document. Three of the reference documents areclassified as “privileged,” while one is classified as “non-responsive.”In making a decision to assign a classification code to a selecteduncoded document, the reviewer can consider classification factors basedon the similar reference documents, such as such as a presence orabsence of similar reference documents with different classificationcodes and a quantity of the similar reference documents for eachclassification code. Other classification factors are possible. In thecurrent example, the display 81 provides suggestions, including thenumber of “privileged” similar reference documents, the number of“non-responsive” similar reference documents, and the absence of otherclassification codes of similar reference documents. Based on the numberof “privileged” similar reference documents compared to the number of“non-responsive” similar reference documents, the reviewer may be moreinclined to classify the selected uncoded documents as “privileged.”Alternatively, the reviewer may wish to further review the selecteduncoded document based on the multiple classification codes of thesimilar reference documents. Other classification codes and combinationsof classification codes are possible. The reviewer can utilize thesuggestions provided by the similar reference documents to assign aclassification to the selected uncoded document. In a furtherembodiment, the now classified and previously uncoded document can beadded to the set of reference documents for use in classifying otheruncoded documents.

In a further embodiment, similar reference documents can be identifiedfor a cluster or spine to provide suggestions for classifying thecluster and spine. For a cluster, the similar reference documents areidentified based on a comparison of a score vector for the cluster,which is representative of the cluster center and the reference documentscore vectors. Meanwhile, identifying similar reference documents for aspine is based on a comparison between the score vector for the spine,which is based on the cluster center of all the clusters along thatspine, and the reference document score vectors. Once identified, thesimilar reference documents are used for classifying the cluster orspine.

In an even further embodiment, the uncoded documents, including theselected uncoded document, and the similar reference documents can bedisplayed as a document list. FIG. 6 is a screenshot 80 showing, by wayof example, an alternative visual display of the similar referencedocuments 85 and uncoded documents 82. The uncoded documents 82 can beprovided as a list in an uncoded document box 81, such as an emailinbox. The uncoded documents 82 can be identified and organized usinguncoded document factors, such as file name, subject, date, recipient,sender, creator, and classification category 83, if previously assigned.

At least one of the uncoded documents can be selected and displayed in adocument viewing box 84. The selected uncoded document can be identifiedin the list 81 using a selection indicator (not shown), including asymbol, font, or highlighting. Other selection indicators and uncodeddocument factors are possible. Once identified, the selected uncodeddocument can be compared to a set of reference documents to identify thereference documents 85 most similar. The identified similar referencedocuments 85 can be displayed below the document viewing box 84 with anassociated classification code 83. The classification code of thesimilar reference document 85 can be used as a suggestion forclassifying the selected uncoded document. After assigning aclassification code, a representation 83 of the classification can beprovided in the display with the selected uncoded document. In a furtherembodiment, the now classified and previously uncoded document can beadded to the set of reference documents.

Machine Classification of Uncoded Documents

Similar reference documents can be used as suggestions to indicate aneed for manual review of the uncoded documents, when review may beunnecessary, and hints for classifying the uncoded documents, clusters,or spines. Additional information can be generated to assist a reviewerin making classification decisions for the uncoded documents, such as amachine-generated confidence level associated with a suggestedclassification code, as described in common-assigned U.S. Pat. No.8,635,223, issued Jan. 21, 2014, the disclosure of which is incorporatedby reference.

The machine-generated suggestion for classification and associatedconfidence level can be determined by a classifier. FIG. 7 is a processflow diagram 90 showing, by way of example, a method for classifyinguncoded documents by a classifier for use in the method of FIG. 2. Anuncoded document is selected from a cluster (block 91) and compared to aneighborhood of x-similar reference documents (block 92) to identifythose similar reference documents that are most relevant to the selecteduncoded document. The selected uncoded document can be the same as theuncoded document selected for identifying similar reference documents ora different uncoded document. In a further embodiment, amachine-generated suggestion can be provided for a cluster or spine byselecting and comparing the cluster or spine to a neighborhood ofx-reference documents for the cluster or spine.

The neighborhood of x-similar reference documents is determinedseparately for each selected uncoded document and can include one ormore similar reference documents. During neighborhood generation, avalue for x similar reference documents is first determinedautomatically or by an individual reviewer. The neighborhood of similarreference documents can include the reference documents, which wereidentified as similar reference documents according to the method ofFIG. 4, or reference documents located in one or more clusters, such asthe same cluster as the selected uncoded document or in one or morefiles, such as an email file. Next, the x-number of similar referencedocuments nearest to the selected uncoded document are identified.Finally, the identified x-number of similar reference documents areprovided as the neighborhood for the selected uncoded document. In afurther embodiment, the x-number of similar reference documents aredefined for each classification code, rather than across allclassification codes. Once generated, the x-number of similar referencedocuments in the neighborhood and the selected uncoded document areanalyzed by the classifier to provide a machine-generated classificationsuggestion for assigning a classification code (block 93). A confidencelevel for the machine-generated classification suggestion is alsoprovided (block 94).

The machine-generated analysis of the selected uncoded document andx-number of similar reference documents can be based on one or moreroutines performed by the classifier, such as a nearest neighbor (NN)classifier. The routines for determining a suggested classification codeinclude a minimum distance classification measure, also known as closestneighbor, minimum average distance classification measure, maximum countclassification measure, and distance weighted maximum countclassification measure. The minimum distance classification measure fora selected uncoded document includes identifying a neighbor that is theclosest distance to the selected uncoded document and assigning theclassification code of the closest neighbor as the suggestedclassification code for the selected uncoded document. The closestneighbor is determined by comparing the score vectors for the selecteduncoded document with each of the x-number of similar referencedocuments in the neighborhood as the cos σ to determine a distancemetric. The distance metrics for the x-number of similar referencedocuments are compared to identify the similar reference documentclosest to the selected uncoded document as the closest neighbor.

The minimum average distance classification measure includes calculatingan average distance of the similar reference documents for eachclassification code. The classification code of the similar referencedocuments having the closest average distance to the selected uncodeddocument is assigned as the suggested classification code. The maximumcount classification measure, also known as the voting classificationmeasure, includes counting a number of similar reference documents foreach classification code and assigning a count or “vote” to the similarreference documents based on the assigned classification code. Theclassification code with the highest number of similar referencedocuments or “votes” is assigned to the selected uncoded document as thesuggested classification code. The distance weighted maximum countclassification measure includes identifying a count of all similarreference documents for each classification code and determining adistance between the selected uncoded document and each of the similarreference documents. Each count assigned to the similar referencedocuments is weighted based on the distance of the similar referencedocument from the selected uncoded document. The classification codewith the highest count, after consideration of the weight, is assignedto the selected uncoded document as the suggested classification code.

The machine-generated suggested classification code is provided for theselected uncoded document with a confidence level, which can bepresented as an absolute value or a percentage. Other confidence levelmeasures are possible. The reviewer can use the suggested classificationcode and confidence level to assign a classification to the selecteduncoded document. Alternatively, the x-NN classifier can automaticallyassign the suggested classification code. In one embodiment, the x-NNclassifier only assigns an uncoded document with the suggestedclassification code if the confidence level is above a threshold value,which can be set by the reviewer or the x-NN classifier.

Machine classification can also occur on a cluster or spine level onceone or more documents in the cluster have been classified. For instance,for cluster classification, a cluster is selected and a score vector forthe center of the cluster is determined as described above withreference to FIG. 4. A neighborhood for the selected cluster can bedetermined based on a distance metric. The x-number of similar referencedocuments that are closest to the cluster center can be selected forinclusion in the neighborhood, as described above. Each document in theselected cluster is associated with a score vector from which thecluster center score vector is generated. The distance is thendetermined by comparing the score vector of the cluster center with thescore vector for each of the similar reference documents to determine anx-number of similar reference documents that are closest to the clustercenter. However, other methods for generating a neighborhood arepossible. Once determined, one of the classification routines is appliedto the neighborhood to determine a suggested classification code andconfidence level for the selected cluster. The neighborhood of x-numberof reference documents is determined for a spine by comparing a spinescore vector with the vector for each similar reference document toidentify the neighborhood of similar documents that are the mostsimilar.

Providing classification suggestions and suggested classification codeshas been described in relation to uncoded documents and referencedocuments. However, in a further embodiment, classification suggestionsand suggested classification codes can be provided for the uncodeddocuments based on a particular token identified within the uncodeddocuments. The token can include concepts, n-grams, raw terms, andentities. In one example, the uncoded tokens, which are extracted fromuncoded documents, can be clustered. A token can be selected from one ofthe clusters and compared with reference tokens. Relationships betweenthe uncoded token and similar reference tokens can be displayed toprovide classification suggestions for the uncoded token. The uncodeddocuments can then be classified based on the classified tokens.

While the invention has been particularly shown and described asreferenced to the embodiments thereof, those skilled in the art willunderstand that the foregoing and other changes in form and detail maybe made therein without departing from the spirit and scope.

What is claimed is:
 1. A computer-implemented system for providingclassification suggestions, comprising: a database to maintain a set ofuncoded documents; a selection module to select one of the uncodeddocuments and to compare the selected uncoded document with a set ofreference documents each associated with a classification; a similaritymodule to identify those reference documents that are similar to theuncoded document; an identification module to identify relationshipsbetween the uncoded document and each reference document comprisingcounting a number of similar reference documents associated with eachdifferent classification; and a suggestion module to suggest for theselected uncoded document the classification having a highest count ofsimilar reference documents.
 2. A system according to claim 1, furthercomprising: a vector module to generate a score vector for each uncodeddocument and each reference document, wherein the score vectors eachcomprise one or more terms occurring in that document and a score foreach term; and a comparison module to determine a similarity value foreach reference document and the uncoded document by comparing the scorevectors of that reference document to the score vector of the uncodeddocument
 3. A system according to claim 2, further comprising: athreshold module to apply a predetermined threshold to the similarityvalues and to identify those reference documents with similarity valuesthat satisfy the predetermined threshold as the reference documentssimilar to the uncoded document.
 4. A system according to claim 1,further comprising: a display to display the similar reference documentswith the uncoded documents.
 5. A system according to claim 1, furthercomprising: a display to display the classifications with the similarreference documents.
 6. A system according to claim 5, furthercomprising: a classification display module to differentiate differenttypes of the classifications via at least one of color, symbol, andshape.
 7. A system according to claim 1, further comprising: a placementmodule to add the selected uncoded document with the suggestedclassification to the set of reference documents.
 8. A system accordingto claim 1, further comprising: a display to display the uncodeddocuments as a list, wherein at least the selected uncoded document isdisplayed with the suggested classification.
 9. A system according toclaim 8, further comprising: a reference selection module to select afurther uncoded document from the set displayed in the list; asimilarity display module to display the reference documents similar tothe further selected uncoded document; and a classification receiptmodule to receive for the further uncoded document one of theclassifications associated with one or more of the reference documentssimilar to the further selected uncoded document.
 10. A system accordingto claim 1, further comprising: a distance determination module todetermine a distance between the selected uncoded document and each ofthe similar reference documents; and a weighting module to weigh thecount of similar reference documents associated with each classificationbased on the distances of the associated similar reference documents.11. A computer-implemented method for providing classificationsuggestions, comprising: maintaining a set of uncoded documents;selecting one of the uncoded documents and comparing the selecteduncoded document with a set of reference documents each associated witha classification; identifying those reference documents that are similarto the uncoded document; identifying relationships between the uncodeddocument and each reference document comprising counting a number ofsimilar reference documents associated with each differentclassification; and suggesting for the selected uncoded document theclassification having a highest count of similar reference documents.12. A method according to claim 11, further comprising: generating ascore vector for each uncoded document and each reference document,wherein the score vectors each comprise one or more terms occurring inthat document and a score for each term; and determining a similarityvalue for each reference document and the uncoded document by comparingthe score vectors of that reference document to the score vector of theuncoded document
 13. A method according to claim 12, further comprising:applying a predetermined threshold to the similarity values; andidentifying those reference documents with similarity values thatsatisfy the predetermined threshold as the reference documents similarto the uncoded document.
 14. A method according to claim 11, furthercomprising: displaying the similar reference documents with the uncodeddocuments.
 15. A method according to claim 11, further comprising:displaying the classifications with the similar reference documents. 16.A method according to claim 15, further comprising: differentiatingdifferent types of the classifications via at least one of color,symbol, and shape.
 17. A method according to claim 11, furthercomprising: adding the selected uncoded document with the suggestedclassification to the set of reference documents.
 18. A method accordingto claim 11, further comprising: displaying the uncoded documents as alist, wherein at least the selected uncoded document is displayed withthe suggested classification.
 19. A method according to claim 18,further comprising: selecting a further uncoded document from the setdisplayed in the list; displaying the reference documents similar to thefurther selected uncoded document; and receiving for the furtherselected uncoded document one of the classifications associated with oneor more of the reference documents similar to the further selecteduncoded document.
 20. A method according to claim 11, furthercomprising: determining a distance between the selected uncoded documentand each of the similar reference documents; and weighing the count ofsimilar reference documents associated with each classification based onthe distances of the associated similar reference documents.